Skip to content

Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351

Merged
igerber merged 19 commits intomainfrom
sdid-bootstrap-refit
Apr 23, 2026
Merged

Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351
igerber merged 19 commits intomainfrom
sdid-bootstrap-refit

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 22, 2026

Summary

  • Ship paper-faithful variance_method="bootstrap_refit" for SyntheticDiD — Arkhangelsky et al. (2021) Algorithm 2 step 2. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-scale zeta. Opt-in; default remains "placebo".
  • Cross-surface allow-list extensions so the new enum flows through all surfaces that read variance_method: results.py:960 summary line, business_report.py:602 BR inference-label, synthetic_did.py:695 n_bootstrap result population, power.py SDID guidance strings, SyntheticDiD.__init__ docstring, diff_diff/guides/llms-full.txt. Survey + refit raises NotImplementedError upstream in fit() — Rao-Wu rescaled-weight composition is tracked as a follow-up TODO.
  • Coverage Monte Carlo study (benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writes benchmarks/data/sdid_coverage.json. Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean-SE / true-SD ratios are transcribed into REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal calibration (rej@0.05 ≈ 0.04–0.08) across all 3 DGPs; fixed-weight over-rejects by ~1.8–3.2× on smaller panels; placebo is also near-nominal; jackknife is slightly anti-conservative on smaller panels (matches Arkhangelsky §6.3's reported 98% / 93% coverage pattern).
  • SyntheticDiD.set_params validation gap: the sklearn-style setter path bypassed constructor validation for the new enum. Extract a shared _validate_config helper; make set_params transactional so a validation failure rolls back touched attributes rather than leaving the instance partially mutated.

Methodology references (required if estimator / math changes)

  • Method name(s): Synthetic Difference-in-Differences bootstrap variance — new variance_method="bootstrap_refit" variant.
  • Paper / source link(s): Arkhangelsky, Athey, Hirshberg, Imbens, Wager (2021), "Synthetic Difference-in-Differences," AER 111(12). New variant implements Algorithm 2 step 2 verbatim (re-estimate ω̂_b and λ̂_b per draw via Frank-Wolfe). Cross-language anchor: Julia Synthdid.jl::src/vcov.jl:96-103 is the only existing refit-bootstrap implementation; R synthdid::vcov(method="bootstrap") and Stata sdid.ado:1033-1037 both use the fixed-weight shortcut. Full methodology surface + requirements checklist row in docs/methodology/REGISTRY.md §SyntheticDiD.
  • Any intentional deviations from the source (and why): Survey designs (including pweight-only) composed with bootstrap_refit raise NotImplementedError — Rao-Wu rescaled weights composed with FW re-estimation needs its own derivation (paper is un-survey; R has no survey support). Documented in REGISTRY.md and tracked in TODO.md. FW non-convergence warnings are aggregated into a single summary UserWarning at end-of-loop if the rate exceeds 5% (same threshold as retry-exhaustion), rather than emitting one per bootstrap draw; this preserves the warning signal without 200+ per-fit spam. No deviations from the paper's estimator math, SE formula (sqrt((r-1)/r) × sd(ddof=1) — unchanged), or p-value dispatch (analytical from bootstrap SE — unchanged from PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349).

Validation

  • Tests added/updated:
    • tests/test_methodology_sdid.py::TestBootstrapRefitSE — 8 real-fit tests (positive SE, diverges from fixed, tracks placebo on exchangeable DGP, raises on pweight + full-design survey, analytical p-value dispatch, enum validation, summary renders replications)
    • tests/test_methodology_sdid.py::TestPValueSemantics::test_refit_p_value_matches_analytical — mirror of the fixed-weight bootstrap regression guard
    • tests/test_methodology_sdid.py::TestGetSetParams — 4 new set_params validation tests (accept bootstrap_refit, reject invalid enum, reject incoherent n_bootstrap, allow n_bootstrap=1 under jackknife) + 1 transactional-rollback test
    • tests/test_methodology_sdid.py::TestCoverageMCArtifact — schema smoke test on the MC JSON (guarded with pytest.skip if absent per feedback_golden_file_pytest_skip.md)
    • tests/test_business_report.py::TestSyntheticDiDBootstrapRefitInferenceLabel — cross-surface guard that BR emits "bootstrap_refit variance" on alpha-override, not the analytical fallback label
    • PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349's 1e-10 R-parity bit-identity test (TestJackknifeSERParity::test_bootstrap_se_matches_r) still passes — fixed-weight path byte-identical to baseline despite the _bootstrap_se signature expansion
  • Backtest / simulation / notebook evidence (if applicable): Coverage MC artifact committed at benchmarks/data/sdid_coverage.json (500 seeds × B=200 × 3 DGPs × 4 methods, generated on M-series Mac + Rust backend in ~40 min). Headline calibration table rendered in REGISTRY.md §SyntheticDiD:
DGP method α=0.01 α=0.05 α=0.10 mean SE / true SD
balanced placebo 0.016 0.060 0.086 1.05
balanced bootstrap 0.106 0.160 0.230 0.85
balanced bootstrap_refit 0.028 0.078 0.116 1.05
balanced jackknife 0.066 0.112 0.154 1.08
unbalanced placebo 0.006 0.032 0.070 1.08
unbalanced bootstrap 0.036 0.098 0.140 0.94
unbalanced bootstrap_refit 0.008 0.038 0.080 1.11
unbalanced jackknife 0.024 0.076 0.120 0.99
AER §6.3 placebo 0.018 0.058 0.086 0.99
AER §6.3 bootstrap 0.034 0.092 0.162 0.88
AER §6.3 bootstrap_refit 0.010 0.040 0.078 1.05
AER §6.3 jackknife 0.030 0.080 0.150 0.90

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 3 commits April 22, 2026 11:56
Implements Arkhangelsky et al. (2021) Algorithm 2 step 2 as an opt-in
variance method that re-estimates ω̂_b and λ̂_b via two-pass sparsified
Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-
scale zeta. Default remains "placebo".

Cross-surface allow-list extensions land in one PR per
feedback_cross_surface_parity_audit.md:
- SyntheticDiD.fit() dispatcher and _bootstrap_se signature
- synthetic_did.py:695 n_bootstrap result population
- results.py:960 summary() "Bootstrap replications" gating
- business_report.py:602 inference-label allow-list
- power.py SDID guidance strings (2 sites)
- SyntheticDiD.__init__ docstring and diff_diff/guides/llms-full.txt

Survey + bootstrap_refit raises NotImplementedError upstream in fit()
(covers both pweight-only and full-design) — the Rao-Wu rescaled-weight
composition is tracked as a follow-up TODO.

Coverage MC study (benchmarks/python/coverage_sdid.py) runs 500 seeds ×
B=200 × 3 DGPs × 4 methods under H0 and writes
benchmarks/data/sdid_coverage.json (4.4 KB). Rejection rates at α ∈
{0.01, 0.05, 0.10} and mean SE / true SD ratios are transcribed into
REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal
calibration across all 3 DGPs; fixed-weight over-rejects by roughly
1.8–3.2× on smaller panels, consistent with the SE under-estimate from
ignoring weight-estimation uncertainty.

Tests: TestBootstrapRefitSE (8 tests) + test_refit_p_value_matches_analytical
in TestPValueSemantics + TestCoverageMCArtifact schema smoke test
(guarded with pytest.skip per feedback_golden_file_pytest_skip.md) +
cross-surface BR inference-label test. PR #349's 1e-10 R-parity
bit-identity gate still passes.

Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside
the refit loop and aggregated into a single summary warning at end-of-
loop if the rate exceeds 5% — the same threshold the retry-exhaustion
guard uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI review caught that the sklearn-style setter path bypassed the
constructor's enum/coherence checks, so users could
``set_params(variance_method='not_a_method')`` after construction and
slip past the __init__ validation added for ``bootstrap_refit``.

Extract the existing checks into a private ``_validate_config()`` helper
and call from both ``__init__`` and ``set_params`` so both paths enforce
the same contract. Constant-hoist the valid-methods tuple onto the class
as ``_VALID_VARIANCE_METHODS`` so __init__ and the validator share a
single source.

Add regression tests under ``TestGetSetParams``:
- set_params accepts ``bootstrap_refit``
- set_params rejects unknown variance_method (parity with __init__)
- set_params rejects incoherent n_bootstrap < 2 when method != jackknife
- set_params allows n_bootstrap=1 under jackknife (deterministic)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the review's P2 finding: if ``_validate_config`` rejects the
post-update state in a multi-attribute ``set_params`` call, the instance
was left with partially-applied (invalid) values after the raised
``ValueError``. Snapshot originals before any setattr and restore them
in an except handler so the raise leaves the object consistent with its
pre-call configuration.

Regression test asserts post-raise state matches the pre-call state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 issues found. The new bootstrap_refit path re-estimates both ω and λ per draw and keeps the documented SE / p-value semantics aligned with the Methodology Registry.
  • The bootstrap_refit + survey-design exclusion is explicitly documented in the registry and tracked in TODO.md, so it is informational rather than blocking.
  • P3: the aggregated Frank-Wolfe non-convergence warning is counted on raw solver warnings rather than draw-level failures, so it can fire earlier than the registry text implies.
  • P3: a few in-code docstrings still describe the pre-PR variance-method surface.
  • P3: the existing scale-equivariance regression suite was not extended to bootstrap_refit, so the new normalization-sensitive path lacks a direct guard.

Methodology

No unmitigated findings. The refit path in diff_diff/synthetic_did.py:L583 and diff_diff/synthetic_did.py:L1071 re-estimates both weights per draw, and the non-placebo inference dispatch in diff_diff/synthetic_did.py:L642 matches the registry notes in docs/methodology/REGISTRY.md:L1505 and docs/methodology/REGISTRY.md:L1555.

  • Severity P3. Impact: bootstrap_refit still rejects any survey design, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103, so it is mitigated and does not block approval. Concrete fix: none for this PR; the follow-up Rao-Wu + refit derivation is already recorded.

Code Quality

  • Severity P3. Impact: the refit non-convergence summary warning counts raw solver warnings in diff_diff/synthetic_did.py:L1092 and compares them to successful draws in diff_diff/synthetic_did.py:L1173, while the registry text says the warning should trigger when the non-convergence rate exceeds 5% of draws in docs/methodology/REGISTRY.md:L1513. That can emit the warning below the documented draw-level threshold when both omega and lambda warn on the same draw. Concrete fix: count “any non-convergence on this draw” as a boolean, or normalize against 2 * n_successful if the intended unit is solver calls.

Performance

No findings. The extra cost of bootstrap_refit is opt-in and documented in diff_diff/synthetic_did.py:L56 and docs/methodology/REGISTRY.md:L1513.

Maintainability

No unmitigated findings.

Tech Debt

  • Severity P3. Impact: placebo_effects now carries placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values, so the name is increasingly misleading after this PR. This is already tracked in TODO.md:L129. Concrete fix: none required for approval; follow the tracked rename to a neutral name such as variance_effects.

Security

No findings.

Documentation/Tests

Execution note: I could not run the test suite in this sandbox because pytest and numpy are not installed here.

…ression

Three P3 items from the CI AI review, all under Documentation/Tests and
Code Quality (no methodology change):

1. **FW warning counting**: the refit loop counted raw solver warnings
   (up to 3 per draw: ω pre-sparsify, ω main, λ main), not draws. REGISTRY
   text describes the rate per valid bootstrap draw, so the two could
   diverge by up to 3×. Change counts "any non-convergence on this draw"
   as boolean 1 and tighten the REGISTRY wording to call out the per-draw
   convention explicitly (each draw runs Frank-Wolfe once for ω and once
   for λ; either can trip the draw).

2. **Docstring gaps**: `fit()` now documents the new survey + refit
   NotImplementedError under the `survey_design` parameter and in the
   Raises block. `SyntheticDiDResults.variance_method` and
   `SyntheticDiDResults.placebo_effects` fields now list `bootstrap_refit`
   and describe the four-way semantics of `placebo_effects` under each
   variance method.

3. **Scale-equivariance regression**: add `"bootstrap_refit"` to the
   `variance_method` parametrizations in `TestScaleEquivariance`
   (baseline-parity, scale-equivariance across ~15 orders of magnitude,
   and extreme-scale-detection). Capture a bit-identity baseline for the
   refit path on `_make_panel(seed=42)`, so the normalization-sensitive
   branch has its own drift guard, not inferred from sibling methods.

All 37 SDID targeted tests + 361 broader sweep tests pass; fixed-weight
1e-10 R-parity bit-identity gate preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6a033267682aa63852637ca5206c4eab06808c80


Overall Assessment

⚠️ Needs changes

Executive Summary

  • [Newly identified] The PR now codifies the mapping “bootstrap = R-compatible fixed-weight shortcut, bootstrap_refit = paper-faithful refit” across the registry, public docstrings, changelog, and new regression tests, but the official synthdid source currently accessible does not support that blanket “matching R” claim. It stores update.omega / update.lambda in attr(estimate, "opts"), passes those opts back through bootstrap_sample(), and notes the supplied bootstrap weights are used only for initialization. (raw.githubusercontent.com)
  • Previous non-blocking findings from the last review look addressed: the BusinessReport allow-list now includes bootstrap_refit, SyntheticDiDResults.summary() shows bootstrap replications for refit, and the scale-equivariance suite now covers the new variance path.
  • The bootstrap_refit survey-design exclusion is documented in the methodology registry and tracked in TODO.md, so it is informational rather than blocking.
  • I could not execute the test suite in this sandbox because pytest, numpy, and pandas are not installed.

Methodology

  • Severity P1 [Newly identified]. Impact: The changed method description in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12 says the existing variance_method="bootstrap" is the R-compatible fixed-weight bootstrap and positions bootstrap_refit as the new paper-faithful variant. The official synthdid source currently accessible contradicts that as written: synthdid_estimate() persists update.omega / update.lambda in attr(estimate, "opts"), bootstrap_sample() feeds those opts back into the estimator, and the package source says the supplied weights are used only for initialization in bootstrap/placebo SEs. That makes the current fixed-weight path an undocumented deviation from the cited reference implementation, and the new fixed-vs-refit regression in tests/test_methodology_sdid.py:L568-L596 now hard-codes that mismatch into the test surface. Concrete fix: either (1) document the current bootstrap path in REGISTRY.md as a **Note (deviation from R):** and remove the “matching R” / “R-compatible” language everywhere, or (2) retarget variance_method="bootstrap" to the official synthdid bootstrap semantics and rename the fixed-weight shortcut; in both cases, update the new regression/parity language to match. (raw.githubusercontent.com)
  • Severity P3. Impact: bootstrap_refit still rejects any survey design, but that limitation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103. Concrete fix: none required for approval.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3. Impact: placebo_effects now spans placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values; the name is misleading but already tracked in TODO.md:L129. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the methodology-labeling issue above. The prior doc/test gaps from the last review appear addressed in the changed files.

Path to Approval

  1. Resolve the public method mapping for SDID bootstrap variance: either document the current fixed-weight bootstrap path as a REGISTRY.md deviation from official synthdid, or change bootstrap to match the official source and rename the fixed-weight shortcut.
  2. Remove or rewrite the new “matching R” / “R-compatible” statements in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12.
  3. Update the new fixed-vs-refit regression and any parity messaging so the tests enforce the approved semantics instead of encoding the current unsupported “R-compatible fixed-weight bootstrap” claim.

… deviation

Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that
R's default synthdid::vcov(method="bootstrap") rebinds
attr(estimate, "opts") — which includes update.omega=TRUE from the
original fit — back into synthdid_estimate inside its do.call, so the
renormalized ω is used only as Frank-Wolfe initialization and ω and λ
are re-estimated per draw. R's default bootstrap is refit, not fixed-
weight. The sum_normalize helper in R's source explicitly comments
that the supplied weights "are used only for initialization" in
bootstrap and placebo SEs.

Our variance_method="bootstrap" holds the renormalized ω exactly (no
FW re-run). It is therefore a deliberate deviation from R's default.
Our PR #349 fixture generator at benchmarks/R/... is a manual
fixed-weight invocation — it omits the opts rebind, which defaults
update.omega to FALSE given non-null weights. The 1e-10 parity test
anchors our fixed-weight path to that manual R invocation, not to R's
real vcov behavior.

Documentation-only fix across all claim sites; no methodology or code
behavior changes:

- REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as
  "Alternative: Bootstrap at unit level — fixed-weight shortcut";
  add explicit **Note (deviation from R)** citing the vcov.R /
  synthdid.R opts-rebind mechanism; call out bootstrap_refit as
  matching R's default vcov. Requirements checklist entries and
  R-parity test scope Note rewritten to match.
- diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se
  method docstring drop the "matching R" framing on the fixed-weight
  path; bootstrap_refit is flagged as matching R's default.
- diff_diff/results.py: SyntheticDiDResults.variance_method field doc
  fixed (I introduced the "R-compatible fixed-weight shortcut"
  misphrasing in round 1; it was wrong).
- CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that
  bootstrap_refit matches R's default and the existing fixed-weight
  bootstrap is now explicitly documented as a deviation.
- benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment
  calls out the non-default invocation shape (no opts rebind → runs
  fixed-weight); references the Python test that consumes this fixture.
- tests/test_methodology_sdid.py::test_bootstrap_se_matches_r
  docstring: rewritten to scope the parity check correctly (manual
  R fixed-weight, not R's default vcov).
- TODO.md: add a new row for the refit cross-language parity anchor
  (Julia Synthdid.jl or R via the real vcov path) to make the missing
  anchor explicit.

All 57 targeted tests pass; no methodology change, no numerical
output change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3fa6d8d91ca8a4472d953b775af5f280a8e225d8


Overall Assessment
Looks good.

Executive Summary

  • The previous P1 around SDID bootstrap/R semantics looks materially resolved: fixed variance_method="bootstrap" is now documented as a deliberate deviation from R default, and the paper-faithful refit path is split out cleanly.
  • I did not find a new P0/P1 in the changed estimator path. The refit branch re-estimates both weight sets per draw, unsupported survey composition is rejected up front, and inference still routes through the existing NaN-safe path.
  • The new deferred items are properly tracked in TODO.md, so they are informational rather than blocking.
  • Residual issues are P3-only: a small amount of wording still overstates exact R-default parity or uses the old “R-compatible shortcut” framing.
  • I could not execute pytest here because the sandbox does not have it installed.

Methodology
Previous P1 about fixed bootstrap being mislabeled as R-default-faithful appears addressed.

  • Severity P3. Impact: the public wording still slightly over-claims exact R-default parity for bootstrap_refit. The docs say or imply that the new path matches R’s default bootstrap, but the implementation re-runs the library’s standard fresh-start two-pass Frank-Wolfe helpers rather than seeding unit-weight optimization with the renormalized original ω described in the registry’s own R-trace note. That is a numerical implementation choice, not a paper-methodology defect, so this is informational only, especially because the missing direct refit parity anchor is already tracked. Concrete fix: either weaken the wording to “same refit methodology as R default” or thread the R-style initialization into the refit branch and close the tracked parity follow-up. Refs: docs/methodology/REGISTRY.md:L1505, diff_diff/synthetic_did.py:L61, diff_diff/synthetic_did.py:L1098, diff_diff/utils.py:L1573, TODO.md:L104

Code Quality
No findings.

Performance
No findings.

Maintainability

Tech Debt

  • Severity P3. Impact: the two real follow-ups introduced by this PR are properly documented and tracked, so they do not block approval: survey-design composition for refit and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the follow-ups open. Refs: TODO.md:L103, TODO.md:L104

Security
No findings.

Documentation/Tests
No unmitigated findings. Verification note: I could not run the new tests in this environment because pytest is not installed.

igerber and others added 3 commits April 22, 2026 16:07
variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021
Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap")
behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so
the renormalized ω serves only as Frank-Wolfe initialization). The
previously-shipped fixed-weight shortcut is removed entirely; the
"bootstrap_refit" enum value briefly added in earlier commits of this
PR is folded back into "bootstrap".

Why this is a correctness fix, not just a relabel: the old fixed-weight
"bootstrap" matched neither the paper (which prescribes refit) nor R's
default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored
fixed-weight Python against a manual R invocation that omitted the opts
rebind — both sides were wrong in the same direction. Coverage MC at
benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the
new "bootstrap" tracks placebo near-nominal across the three
representative DGPs; the old fixed-weight column over-rejected at α=0.05
at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal).

Capability regression: SDID + survey designs (pweight-only AND
strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight
bootstrap was the only SDID variance method that supported
strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside
_bootstrap_se). Pweight-only users can switch to variance_method="placebo"
or "jackknife"; strata/PSU/FPC users have no SDID variance option on
this release. Rao-Wu rescaled weights composed with paper-faithful
Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and
reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's
"Note (deferred survey + bootstrap composition)" and TODO.md. The
deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via
`git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite
_bootstrap_se body.

Cross-surface allow-list reverts: the additive "bootstrap_refit" enum
shipped in earlier commits of this PR rippled through results.py:960
summary gating, business_report.py:602 inference-label allow-list,
power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults
field docstrings. All of those are now back to a 3-value surface
("bootstrap", "jackknife", "placebo").

Tests:
- TestBootstrapRefitSE class deleted; 4 unique tests folded into
  TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey,
  raises-full-design-survey, summary-shows-replications).
- test_bootstrap_se_matches_r deleted along with its fixture
  (tests/data/sdid_bootstrap_indices_r.json) and generator
  (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they
  anchored the now-removed fixed-weight path.
- TestPValueSemantics::test_refit_p_value_matches_analytical deleted as
  duplicate of test_bootstrap_p_value_matches_analytical.
- TestScaleEquivariance._BASELINE: "bootstrap" row updated to the
  refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit-
  identical to the captured "bootstrap_refit" baseline since the new
  bootstrap path is the same code as the old refit path. Tolerance
  tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity.
- TestGetSetParams: variance_method literals rebound to "bootstrap";
  test_set_params_accepts_bootstrap_refit deleted (redundant with
  constructor tests).
- TestCoverageMCArtifact: expected methods list set exact-equal to
  ("placebo", "bootstrap", "jackknife").
- test_business_report.py inference-label test class + method renamed
  to drop "refit" suffix; assertion checks for "bootstrap variance".

The benchmarks/data/sdid_coverage.json artifact is updated transitionally
in this commit (fixed-weight column dropped; refit column renamed to
bootstrap) so the schema test stays green; a follow-up commit
regenerates from a fresh 500-seed MC re-run with the new code path.
The REGISTRY coverage table cells are TBD pending that re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc-only follow-up to the previous commit's bootstrap rewrite. Updates
every user-facing surface that referenced the (now-removed) fixed-weight
bootstrap or the additive bootstrap_refit option:

- docs/choosing_estimator.rst: drops the "Via bootstrap" cell from the
  SDID survey-support row (no SDID variance method supports
  strata/PSU/FPC anymore); rewrites the misdirecting note steering users
  to bootstrap for full survey designs; updates the inference summary
  table description for SDID's variance methods.
- docs/survey-roadmap.md: rewrites the SDID limitations table rows to
  reflect the regression matrix (pweight-only works with placebo /
  jackknife; strata/PSU/FPC has no SDID variance option in this release;
  bootstrap rejects all survey designs).
- docs/performance-scenarios.md: updates the SE-comparison scenario's
  timing expectation note (bootstrap is now ~10-100x slower per fit
  than the previous fixed-weight shortcut).
- docs/tutorials/03_synthetic_did.ipynb: rewrites markdown cells 19
  (inference methods description) and 29 (summary) — bootstrap is now
  paper-faithful refit matching R's default vcov, not the prior
  fixed-weight shortcut.
- docs/tutorials/18_geo_experiments.ipynb: rewrites the bootstrap-vs-
  placebo description (cell t18-cell-028); softens the stakeholder
  narrative claim "the two methods agree" to acknowledge that on small
  panels with non-exchangeable factor structure the SE magnitudes can
  differ while both methods still agree on significance and CI direction
  (cell t18-cell-033); re-executes the comparison cell so the output
  reflects the new bootstrap SE = 4.50 (was 4.26 under fixed-weight).
  The drift-guard asserts at cell t18-cell-026 only pin ATT / conf_int /
  pre-fit RMSE — none of which change — so no guard updates needed.
- diff_diff/synthetic_did.py: fit() docstring's survey_design parameter
  description is now consistent with the actual guards (no bootstrap_refit
  references; explicit pweight-only-on-placebo-or-jackknife matrix).
- benchmarks/python/coverage_sdid.py: --help text drops the
  bootstrap_refit mention.
- METHODOLOGY_REVIEW.md: the v3.x SyntheticDiD review entry's claim that
  bootstrap matches R's bootstrap_sample is replaced with an honest
  description of the corrected refit semantics, plus a parenthetical
  historical note about the prior fixed-weight shortcut.

All 351 targeted tests pass; no methodology or numerical change in this
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh 500 seeds × B=200 run of benchmarks/python/coverage_sdid.py with
the new 3-method ALL_METHODS = ("placebo", "bootstrap", "jackknife").
Total wall-clock ~40 min on M-series Mac, Rust backend.

Numbers match the transitional rename in commit 1 exactly (rej@0.05:
balanced=0.078, unbalanced=0.038, aer63=0.040) — expected bit-identity
since the new "bootstrap" path is the same refit code as the previous
"bootstrap_refit" path, and the MC uses identical seeds (range(500)).
Confirms the rewrite didn't introduce numerical drift.

Headline across the three DGPs:
- bootstrap (refit): near-nominal at α=0.05 (0.078 / 0.038 / 0.040)
  versus nominal 0.05; well within 2σ MC band at 500 seeds.
- placebo: also near-nominal (0.060 / 0.032 / 0.058).
- jackknife: mildly anti-conservative on the smaller panels (0.112 on
  balanced, 0.080 on AER §6.3), matching the paper's §6.3 pattern.

REGISTRY.md §SyntheticDiD coverage-MC table cells replace the _TBD_
placeholders with the transcribed numbers. The narrative under the
table (`Reading:` paragraph) was already rewritten in commit 1 for the
3-method surface; no further edit needed.

The transitional commit-1 artifact (where I renamed bootstrap_refit →
bootstrap in-place without running the MC) carried a `transitional_note`
field in the metadata; the fresh run naturally drops that field since
it's written from scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 15481d9a859f1da8fa6f935d3ea5b3d6c8edbca3


Overall Assessment

✅ Looks good

Executive Summary

  • The previous substantive methodology concern is resolved: variance_method="bootstrap" now re-estimates both ω and λ per draw, keeps the documented retry-to-B contract, and still routes inference through safe_inference().
  • I did not find a new unmitigated P0/P1 in the changed estimator path.
  • The survey-support regression is explicit in code and documentation and is properly tracked in TODO.md, so it is informational rather than blocking.
  • Remaining issues are P3-only: one stale replicate-weight fallback message, and wording that still overstates verified parity with R’s default bootstrap.
  • I could not run pytest here because pytest is not installed in this environment.

Methodology

  • Severity P3. Impact: the refit bootstrap implementation fresh-starts each Frank-Wolfe solve via compute_sdid_unit_weights(...) / compute_time_weights(...), but the changed docs still state that the path matches R’s default vcov(method="bootstrap"), whose own rationale in the registry depends on the renormalized ω being threaded in as initialization. That is an implementation-choice / verification issue, not a demonstrated numerical defect, and it is already tracked in TODO.md, so it should not block approval. Concrete fix: either soften the wording to “paper-faithful refit bootstrap, intended to match R default” or add the TODO-tracked cross-language parity harness. Refs: diff_diff/synthetic_did.py:L924-L935, docs/methodology/REGISTRY.md:L1506-L1508, METHODOLOGY_REVIEW.md:L504-L508, CHANGELOG.md:L15-L18, TODO.md:L104-L104
  • No other findings. On the changed estimator path, the code now matches the updated registry on the load-bearing methodology points: it re-estimates both weight sets per bootstrap draw, retries degenerate/non-finite draws until n_bootstrap valid replicates or the bounded budget, preserves the sqrt((r-1)/r) * sd(ddof=1) SE formula, and routes bootstrap/jackknife inference through safe_inference() rather than inline t/p/CI code. Refs: diff_diff/synthetic_did.py:L583-L650, diff_diff/synthetic_did.py:L823-L1024, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

  • No findings.

Performance

  • No findings. The documented slowdown is the expected cost of correcting the methodology from fixed-weight bootstrap to per-draw refitting, not an accidental regression. Refs: CHANGELOG.md:L15-L21

Maintainability

  • Severity P3. Impact: the replicate-weight guard still tells users to “Use a TSL-based survey design (strata/psu/fpc),” but this PR simultaneously removes SDID support for strata/PSU/FPC on all variance methods. Users hitting that path are now sent to an unsupported alternative. Concrete fix: update the exception text to reflect the new contract, e.g. “replicate weights are not supported; pweight-only works with placebo / jackknife, and full survey designs have no SDID variance path in this release.” Refs: diff_diff/synthetic_did.py:L286-L290, diff_diff/synthetic_did.py:L299-L330, docs/methodology/REGISTRY.md:L1549-L1550

Tech Debt

  • Severity P3. Impact: the two real follow-ups introduced by this change are properly tracked, so they do not block approval: SDID+survey bootstrap composition and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the TODOs open. Refs: TODO.md:L103-L104

Security

  • No findings.

Documentation/Tests

CI review on commit 15481d9 flagged the docs as overclaiming parity with
R's default synthdid::vcov(method="bootstrap"): R warm-starts Frank-Wolfe
from the renormalized fit-time ω per draw (and keeps fit-time λ as FW
init for the λ re-estimation), while our Python port was cold-starting
from uniform. On the strictly-convex FW objective with simplex constraint,
warm- and cold-start converge to the same global minimum given enough
iterations — but the 100-iter pre-sparsify pass may not fully converge
on some draws, and then sparsification is path-dependent on the init.

Port the warm-start shape:

- diff_diff/utils.py: compute_sdid_unit_weights and compute_time_weights
  gain an init_weights=None kwarg, forwarded to _sc_weight_fw for the
  first pass. When None (default), preserves the Rust top-level fast-path
  unchanged. When provided, falls through to the Python two-pass
  dispatcher; inner FW calls still dispatch to Rust via _sc_weight_fw, so
  the perf cost is one Python call per pass per draw.
- diff_diff/synthetic_did.py::_bootstrap_se: thread warm-start per draw.
  boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) (same
  shape as R's sum_normalize(weights$omega[sort(ind[ind <= N0])])).
  boot_lambda_init = time_weights (fit-time λ unchanged, matching R's
  weights.boot$lambda = weights$lambda).

Baseline drift is ULP-level: TestScaleEquivariance._BASELINE["bootstrap"]
SE shifts from 0.21424970247101688 (cold-start) to 0.21427381053829253
(warm-start), a handful of the 200 bootstrap draws converging to a
different sparsification pattern than the cold-start pass. rel=1e-14
bit-identity on the warm-start capture holds.

Coverage MC: NOT re-run. User course-correction ("not sure we need to
run the full MC with every pass here"). Math argument: strictly-convex
FW → unique global minimum; aggregated rejection rates shift well under
2σ MC noise (≈0.02–0.05 per cell at B=200 × 500 seeds). The committed
benchmarks/data/sdid_coverage.json (generated under cold-start in commit
15481d9) stays accurate for the warm-start path within MC noise.
REGISTRY.md coverage-MC Note now documents the artifact regeneration
cadence explicitly: regenerate on methodology changes that materially
shift per-draw numerics, not on refactors that preserve the FW global
optimum (warm-start, backend migration, renames, docstring fixes).
Per-seed bit-identity on the captured fixture is the cheaper, stricter
parity check.

Other changes in this commit:
- diff_diff/synthetic_did.py replicate-weight guard message rewritten
  (R2 P3 maintainability): the old message told users to switch to a
  TSL-based survey design with strata/psu/fpc, which is no longer
  supported by any SDID variance method in this release.
- docs/tutorials/18_geo_experiments.ipynb re-executed; bootstrap SE
  displayed in the comparison cell is now 4.44 (warm-start) vs 4.50
  (cold-start). Narrative unchanged.
- CHANGELOG.md and METHODOLOGY_REVIEW.md describe the warm-start shape
  and drop the "cross-language bit parity" implication from earlier
  wording.
- REGISTRY.md bootstrap section describes the warm-start mechanism
  including the FW strict-convexity argument for why warm and cold
  converge to the same answer but warm matches R's procedure at the
  algorithm level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 105cf73a28c6e8aae698945be5cea1020fb79e5b


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The refit SDID bootstrap implementation itself looks methodologically aligned with the updated registry: it re-estimates both ω and λ per draw, keeps the sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference().
  • The prior user-facing survey-message problem appears resolved: the current guard text no longer points users toward unsupported full-design SDID survey paths.
  • P1 [Newly identified]: the survey-bootstrap removal was not propagated to the legacy SDID survey tests. The new guards in diff_diff/synthetic_did.py:L311-L337 now reject exactly the combinations that tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322 still assert should succeed.
  • P2: several changed documentation surfaces still describe the old SDID bootstrap contract (fixed weights / Rao-Wu), so the project’s own methodology cross-references are internally inconsistent.
  • The deferred survey+refit composition work and the cross-language parity anchor are both properly tracked in TODO.md:L103-L104, so they are informational rather than blocking.
  • I could not run pytest here because this environment is missing runtime test dependencies (pytest, numpy).

Methodology

Code Quality

  • No findings.

Performance

  • No findings. The slowdown is the expected cost of replacing the old fixed-weight shortcut with per-draw Frank-Wolfe refitting, and that tradeoff is documented.

Maintainability

  • Severity P3. _bootstrap_se()’s docstring says unit_weights / time_weights “are not used inside the loop,” but the implementation immediately uses both as warm-start initializations. Impact: this gives maintainers the wrong mental model of what is fixed versus merely reused as initialization. Concrete fix: rewrite diff_diff/synthetic_did.py:L858-L862 to say the original weights are not reused as fixed estimator weights, only as Frank-Wolfe warm starts, matching diff_diff/synthetic_did.py:L873-L877 and diff_diff/synthetic_did.py:L925-L962.

Tech Debt

  • Severity P3 informational. The two real follow-ups introduced here are explicitly tracked: survey+refit composition and cross-language refit parity in TODO.md:L103-L104. Impact: none for approval. Concrete fix: none in this PR; keep those TODOs open.

Security

  • No findings.

Documentation/Tests

  • Severity P1 [Newly identified]. The new survey guards in diff_diff/synthetic_did.py:L311-L337 now make bootstrap + pweight-only survey and bootstrap + full survey design raise NotImplementedError, but the legacy SDID survey tests still assert the old success contract in tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322. Impact: the branch leaves the SDID survey contract internally contradictory and will keep the test suite red once dependencies are present. Concrete fix: rewrite those tests to assert the new NotImplementedError behavior, and replace the old Rao-Wu comparison with positive pweight-only placebo / jackknife cases.
  • Severity P2. Several changed documentation surfaces still state the old SDID bootstrap story: docs/methodology/REGISTRY.md:L2699-L2699 still says Unit-level bootstrap (fixed weights), docs/methodology/REGISTRY.md:L2889-L2895 still lists SyntheticDiD under Rao-Wu survey bootstrap, docs/survey-roadmap.md:L52-L54 still lists SyntheticDiD among the Rao-Wu survey-bootstrap estimators, and diff_diff/guides/llms-full.txt:L1674-L1676 still says survey-aware bootstrap includes SyntheticDiD via Rao-Wu. Impact: reviewers and downstream doc consumers get conflicting methodology guidance from the project’s own source material. Concrete fix: update those cross-reference summaries to say SDID bootstrap is the paper-faithful refit path for non-survey fits only, and that SDID has no survey-bootstrap support in this release.

Path to Approval

  1. Update the stale SDID survey tests in tests/test_survey_phase5.py so they match the new public contract: variance_method="bootstrap" with any survey_design should raise NotImplementedError, while pweight-only positive coverage should move to placebo / jackknife.

Addresses CI review R3 findings on PR #351:

P1: rewrite three legacy SDID survey tests in tests/test_survey_phase5.py
that still asserted the old fixed-weight + Rao-Wu success contract so they
match the new upstream guards (bootstrap + any survey design and full
strata/PSU/FPC with any variance method now raise NotImplementedError).
test_full_design_bootstrap_smoke -> test_full_design_bootstrap_raises;
test_full_design_se_differs_from_weights_only is replaced by a positive
placebo pweight-only smoke test since the old Rao-Wu vs pweight-only SE
comparison no longer has two valid paths to compare; test_bootstrap_with
_survey -> test_bootstrap_with_pweight_only_raises plus a new
test_jackknife_with_pweight_only for positive coverage. Also align the
pre-existing placebo/jackknife guard tests with the updated guard wording
(same stale-regex class of issue the P1 flagged).

P2: remove SyntheticDiD from cross-reference Rao-Wu Rescaled Bootstrap
sections that survived the bootstrap rewrite. REGISTRY comparison table
row now describes the alternative as paper-faithful refit bootstrap (not
fixed weights); the Rao-Wu section lists only SunAbraham and TROP with an
explicit Note explaining why SDID is excluded (refit + Rao-Wu composition
requires weighted-FW derivation, tracked in TODO.md and sketched in
REGISTRY SDID survey-regression Note). survey-roadmap.md Phase 6 bullet
and llms-full.txt survey-aware bootstrap bullet updated with the same
framing.

P3: rewrite _bootstrap_se docstring (diff_diff/synthetic_did.py) to say
the fit-time unit_weights and time_weights are Frank-Wolfe warm-start
initializations on each draw, not unused parameters retained for
signature parity. The in-loop narrative comment at the warm-start site
was already correct; the top-of-function docstring was stale from the
pre-warm-start commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f49a19f8cdb71d7cac7da51f37c5738c58b21543


Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings remain in the changed PR surface. The core SDID refit-bootstrap path is aligned with the updated methodology registry, and the prior re-review blocker around stale SDID survey tests appears resolved.

Executive Summary

  • The changed SyntheticDiD bootstrap path is methodologically aligned with the updated registry: it re-estimates ω̂_b and λ̂_b per draw, preserves sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference() for all-or-nothing NaN gating. See diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1057, diff_diff/utils.py:L174-L206, docs/methodology/REGISTRY.md:L1497-L1551.
  • The previous P1 around SDID survey tests is resolved. The changed tests now assert the new NotImplementedError contract for bootstrap and full-design survey paths. See tests/test_survey_phase5.py:L179-L216.
  • The cross-surface result/reporting surfaces look consistent in the current tree: bootstrap replications still render in summary(), and BusinessReport still labels SDID alpha overrides as bootstrap variance instead of falling through to analytical wording. See diff_diff/results.py:L963-L966, diff_diff/business_report.py:L593-L604, tests/test_business_report.py:L4658-L4702.
  • Remaining issues are minor P3 cleanup only: one deferred-work breadcrumb still contains placeholder commit IDs, one new survey test no longer checks the ATT-equivalence contract stated in its name/docstring, and one new coverage-artifact docstring still says "4 methods" although the artifact/test now use 3.
  • I could not run pytest here because this environment is missing numpy; verification was limited to static inspection plus compile() parsing of the changed Python files.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the deferred survey-bootstrap follow-up is properly tracked, but the newly added breadcrumbs are not directly actionable because they still contain placeholder SHAs (<sdid-bootstrap-refit-removal-sha> and <removal-commit>). Concrete fix: replace those placeholders with the actual commit hash, or rewrite the note to point to a stable PR number / git log --grep lookup instead. See TODO.md:L103-L104, docs/methodology/REGISTRY.md:L1550-L1550.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_placebo_full_vs_pweight_only_att_match is named and documented as an ATT-equivalence regression check, but its body now only asserts that the pweight-only fit is finite. That means the claimed "stripping strata/PSU should not change ATT" contract is no longer actually tested. Concrete fix: either compare the pweight-only ATT against the same data with strata/PSU/FPC columns ignored/dropped, or rename the test/docstring to describe it as a smoke test. See tests/test_survey_phase5.py:L227-L252.
  • Severity P3. Impact: the new coverage-artifact test docstring still says the MC study covers "4 methods," while the artifact schema and assertions immediately below now enforce exactly 3 methods (placebo, bootstrap, jackknife). That leaves stale methodology wording in the new regression surface. Concrete fix: update the docstring to say "3 methods." See tests/test_methodology_sdid.py:L2865-L2907.

- Replace placeholder commit SHAs in TODO.md and REGISTRY.md deferred-
  work breadcrumbs with the actual fixed-weight-removal commit hash
  (91082e5 "Replace SDID fixed-weight bootstrap with paper-faithful
  refit"). The follow-up implementer can now directly run
  `git show 91082e5:diff_diff/synthetic_did.py` to recover the Rao-Wu
  scaffolding (rw_control/rw_treated split, degenerate-retry check,
  treated-mean weighting). Also cite PR #351 for stability against
  commit-hash churn.

- Rename tests/test_survey_phase5.py::test_placebo_full_vs_pweight_only
  _att_match to test_placebo_with_pweight_only_full_design_stripped_att
  _match and promote the body from a single-fit smoke to the ATT-
  equivalence check the docstring claims: fit with two equivalent
  pweight-only SurveyDesign constructions and assert ATT matches bit-
  for-bit (abs=1e-12). The intent is to confirm strata/psu/fpc columns
  sitting on the DataFrame are not tacitly read unless the SurveyDesign
  references them.

- Fix stale "3 DGPs × 4 methods" docstring in
  tests/test_methodology_sdid.py::TestCoverageMCArtifact to read
  "3 DGPs × 3 methods", matching the 3-method schema the test now
  enforces (placebo / bootstrap / jackknife; bootstrap_refit and the
  old fixed-weight 'bootstrap' enum are both gone).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ec92d39b9fbf6d16f5e7642bf534b278a6d35a83


Overall Assessment
Looks good

No unmitigated P0 or P1 findings in the changed PR surface.

Executive Summary

Methodology

  • No unmitigated findings. The changed estimator and variance code is internally consistent with the methodology registry on Algorithm 2 refit bootstrap, analytical bootstrap/jackknife p-values, and NaN-safe inference. diff_diff/synthetic_did.py#L644 diff_diff/synthetic_did.py#L842 docs/methodology/REGISTRY.md#L1497
  • Severity P3. Impact: SDID survey bootstrap and full-design survey support regresses in this release, but the deviation is explicitly documented and tracked, so it is informational only under the review policy. Concrete fix: no approval blocker; the weighted-Frank-Wolfe survey composition follow-up is already specified in TODO.md#L103.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: external parity for the new refit-bootstrap semantics is still deferred; confidence currently rests on same-library regression tests plus the committed MC artifact rather than a live R/Julia anchor. Concrete fix: implement the cross-language parity harness already tracked in TODO.md#L104.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_placebo_with_pweight_only_full_design_stripped_att_match compares two semantically identical pweight-only SurveyDesigns, so it cannot catch the regression its docstring claims to cover. A future silent pickup of stratum or psu columns would still pass this test. Concrete fix: compare the pweight-only fit against the same dataset after dropping or renaming stratum and psu, or against a copy where those columns are absent. tests/test_survey_phase5.py#L227

Verification note: runtime execution was not possible here because this environment is missing numpy and pytest; review was limited to static inspection and AST parsing.

The previous rewrite compared two semantically identical pweight-only
SurveyDesign constructions, which can't catch the contract the docstring
claimed — if a future change silently picked up `stratum` or `psu` by
name, both fits would pick them up identically and the test would still
pass.

Rewrite to compare a fit on the original DataFrame (with `stratum` /
`psu` columns present) against a fit on the same data with those columns
physically dropped. If the estimator ever silently reads those columns
by naming convention the two fits would diverge and the abs=1e-12 ATT
check would fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 826b1a8e722960a0936da30785fbdf5f4ed8cf97


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The remaining issues are P2/P3 only. Runtime verification was static-only because this environment does not have numpy, pandas, or pytest.

Executive Summary

Methodology

  • Severity P3. Impact: No unmitigated methodology defect. The affected method is SyntheticDiD bootstrap variance. The refit path in diff_diff/synthetic_did.py:L830-L1057 refits both ω_b and λ_b, keeps the sqrt((r-1)/r) aggregation, and still routes bootstrap/jackknife inference through safe_inference() in diff_diff/synthetic_did.py:L643-L657, matching the local registry at docs/methodology/REGISTRY.md:L1497-L1551. Upstream synthdid also documents bootstrap/jackknife/placebo as Algorithms 2/3/4, and its bootstrap source calls back into synthdid_estimate() while synthdid_estimate() defaults update.omega / update.lambda to refitting-from-initialization when weights are supplied. Concrete fix: none. (synth-inference.github.io)
  • Severity P3. Impact: The SDID survey-bootstrap/full-design regression is explicit and mitigated by documentation/tracking: any survey + variance_method="bootstrap" fit and any full-design SDID fit now raises NotImplementedError, and that deviation is both documented and tracked. Concrete fix: none for approval; the weighted-Frank-Wolfe survey-bootstrap follow-up is already queued. diff_diff/synthetic_did.py:L283-L336 docs/methodology/REGISTRY.md:L1549-L1550 TODO.md:L103-L104

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: External parity anchoring for the refit bootstrap remains deferred. The old fixed-weight R fixture is deleted, and the replacement Julia/R parity anchor is only tracked in TODO.md, so this is informational rather than blocking. Concrete fix: land the queued cross-language parity harness already listed in TODO.md:L104-L104.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: [Newly identified] test_bootstrap_p_value_null_calibration still documents and asserts the removed fixed-weight bootstrap behavior. Its docstring says the test is for the “fixed-weight regime” and its lower-bound assertion rejection_rate > 0.05 assumes systematic over-rejection, but the new registry and committed coverage artifact now describe near-nominal refit-bootstrap calibration. That leaves the test description wrong today and can make the assertion flaky or fail on correct code. Concrete fix: rewrite the test around its actual regression target, namely “not collapsed near 0 and not catastrophically large,” and update the docstring/comments to the refit-bootstrap semantics. tests/test_methodology_sdid.py:L2547-L2589 docs/methodology/REGISTRY.md:L1552-L1570
  • No other findings. The prior informational survey-test issue appears addressed in tests/test_survey_phase5.py:L227-L270.

Prior behavior: ``_bootstrap_se`` tallied Frank-Wolfe non-convergence via
``warnings.catch_warnings``, but the Rust FW entry point is silent on
``max_iter`` exhaustion (only the pure-NumPy path called
``warn_if_not_converged``). On the default Rust backend the aggregate
warning at the end of the bootstrap loop therefore never fired, even
when draws did not converge — a silent failure.

Fix: thread an explicit convergence bool out of the Rust solver.

Rust (``rust/src/weights.rs``, ``rust/src/lib.rs``)
- ``sc_weight_fw_gram`` / ``sc_weight_fw_standard`` now set and return
  ``converged = true`` on a min-decrease break, ``false`` otherwise.
- ``sc_weight_fw_internal`` returns ``(Array1<f64>, bool)``.
- Existing ``sc_weight_fw`` pyfunction destructures and drops the bool,
  preserving its ABI for the rank-selection heuristic in ``prep.py``
  and for any third-party consumer.
- New pyfunction ``sc_weight_fw_with_convergence`` returns the
  ``(array, bool)`` tuple, wrapping the same internal solver.
- Internal helpers ``compute_time_weights_internal`` /
  ``compute_sdid_unit_weights_internal`` destructure the inner calls
  and still return ``Array1<f64>`` (their pyfunctions discard
  convergence — Python callers that need it use the Python two-pass
  dispatcher).

Python (``diff_diff/utils.py``, ``diff_diff/_backend.py``)
- Import the new Rust entry point as
  ``_rust_sc_weight_fw_with_convergence``.
- ``_sc_weight_fw`` / ``_sc_weight_fw_numpy`` gain a
  ``return_convergence=False`` kwarg. Default path is unchanged; with
  the flag set, return ``(weights, converged)``.
- ``compute_sdid_unit_weights`` / ``compute_time_weights`` gain the
  same kwarg and propagate the AND of the two FW passes (pre-sparsify
  + main). When the flag is set, the Rust top-level fast-path is
  skipped (it is silent on non-convergence) in favor of the Python
  two-pass dispatcher; inner FW calls still dispatch to the Rust
  solver via ``sc_weight_fw_with_convergence``.

Consumers (``diff_diff/synthetic_did.py::_bootstrap_se``)
- Replace the ``warnings.catch_warnings`` dance with an explicit
  ``return_convergence=True`` call on both helpers. ``fw_nonconvergence
  _count`` now increments based on the AND of ``omega_converged`` and
  ``lambda_converged`` per draw — works under both Rust and pure-
  Python backends.

Regression test (``tests/test_methodology_sdid.py::TestBootstrapSE::
test_bootstrap_fw_nonconvergence_warning_fires_under_rust``)
- Monkeypatches ``_rust_sc_weight_fw_with_convergence`` to always
  report ``converged=False`` and asserts the aggregate FW non-
  convergence ``UserWarning`` fires on an otherwise-nominal SDID
  bootstrap fit. Skipped when the Rust backend is unavailable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c0d089bf9d6344c079fbba7f7a00717888d1f463


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. On this re-review, the earlier Rust non-convergence warning issue and the BR inference-label concern both look addressed. The remaining issues are P2/P3 only.

Executive Summary

  • The affected method is SyntheticDiD bootstrap variance. The implementation now refits both ω and λ per bootstrap draw, keeps the established sqrt((r-1)/r) × sd(...) SE aggregation, and still routes analytical inference through safe_inference(). I did not find an unmitigated methodology defect.
  • The previous Rust-warning concern appears resolved by the new convergence-return plumbing in diff_diff/utils.py:1350, diff_diff/synthetic_did.py:958, and rust/src/weights.rs:518.
  • The previous BR allow-list concern also appears resolved; unchanged diff_diff/business_report.py:602 already recognizes SDID variance_method values, and the new regression test at tests/test_business_report.py:4689 covers the alpha-override path.
  • One previous P2 remains unresolved: tests/test_methodology_sdid.py:2604 still documents and bounds bootstrap null calibration as if the deleted fixed-weight bootstrap were still the live behavior.
  • Static-only review here; I could not run pytest because the command is unavailable in this environment.

Methodology

Affected method: SyntheticDiD bootstrap variance. Cross-checking docs/methodology/REGISTRY.md:1497, diff_diff/synthetic_did.py:948, diff_diff/synthetic_did.py:1053, and diff_diff/synthetic_did.py:644 against the cited SDID material and the current synthdid bootstrap flow did not reveal an unmitigated methodology defect. citeturn0search0turn3view0turn3view1

  • Severity P3. Impact: The survey-bootstrap capability regression is explicitly documented and tracked in docs/methodology/REGISTRY.md:1549 and TODO.md:107, so under this rubric it is informational rather than blocking. Concrete fix: none for approval.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: The deleted cross-language parity fixture leaves an external parity anchor as follow-up work, but that follow-up is explicitly tracked in TODO.md:108, so it is not a blocker. Concrete fix: none for approval.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: [Previous finding unresolved] tests/test_methodology_sdid.py:2604 still describes bootstrap as the removed fixed-weight regime and hard-codes rejection_rate > 0.05 at α=0.05. The updated calibration table in docs/methodology/REGISTRY.md:1557 and docs/methodology/REGISTRY.md:1566 now says refit bootstrap is near nominal, so this test can fail on correct behavior or bias future changes toward anti-conservative output. Concrete fix: rewrite the docstring/comments to the refit-bootstrap contract and replace the lower bound with a calibration-agnostic regression guard, e.g. “not collapsed near 0 and not catastrophically high,” or a dispersion/quantile check that specifically catches the old dispatch bug.
  • Severity P3. Impact: [Newly identified] The Methodology Registry still says the Rust backend is silent on Frank-Wolfe non-convergence in docs/methodology/REGISTRY.md:1528, but this PR now threads explicit convergence status through Rust in rust/src/weights.rs:518, diff_diff/utils.py:1350, and emits the aggregate warning in diff_diff/synthetic_did.py:1041. Since REGISTRY.md is the methodology source of truth, that note is now stale. Concrete fix: update the edge-case bullet to reflect the new Rust convergence flag and the aggregated bootstrap warning behavior.

…W note

P2 (Documentation/Tests) — ``test_bootstrap_p_value_null_calibration``
at ``tests/test_methodology_sdid.py:2604`` was still framed around the
deleted fixed-weight regime and encoded an anti-conservative lower
bound ``rejection_rate > 0.05`` that biased the test toward keeping
over-rejection. Rename to ``test_bootstrap_p_value_null_dispersion``
and replace the directional calibration assertion with a
calibration-agnostic dispersion guard: ``np.std(p_values) > 0.10``
catches the pre-fix p-clustering dispatch bug (p≈0.5 on every seed
under the old bug), and a loose band ``0.01 <= rejection_rate <=
0.40`` catches both SE-collapse and SE-explosion. Matches this
session's commit c0d089b by removing the remaining lingering
fixed-weight framing in the test suite. Tested locally: 100-seed
refit bootstrap passes with comfortable margin.

P3 (newly identified) — REGISTRY.md §Frank-Wolfe non-convergence edge
case (line 1528) still said the Rust backend silently returns the
final iterate and cited a follow-up. This session's commit c0d089b
already landed the Rust-side convergence flag, Python
``return_convergence=True`` kwargs on the weight helpers, and the
aggregate ``UserWarning`` in ``_bootstrap_se``. Update the bullet to
describe the new behavior: bootstrap path now uses
``sc_weight_fw_with_convergence`` (Rust) / numpy
``return_convergence=True`` and aggregates a single summary warning
above the 5% threshold; standalone callers that do not opt into
convergence tracking retain the legacy silent-Rust / warning-numpy
asymmetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9280cddab051623d6a9b5d30c9eb006cecf81397


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The SDID refit-bootstrap implementation itself looks methodologically consistent with the cited SDID paper/official synthdid variance surface, and the previous re-review items around the null-calibration test and the Rust non-convergence note appear addressed. Static review only; I did not run pytest or the coverage Monte Carlo harness here.

Executive Summary

  • The affected method is SyntheticDiD variance estimation, specifically the refit variance_method="bootstrap" path; I did not find a silent correctness bug in the changed estimator/SE/inference flow.
  • The prior test-related concern looks resolved: the slow bootstrap null-calibration guard is now calibration-agnostic instead of baking fixed-weight over-rejection into the expected behavior.
  • The survey-bootstrap capability regression is explicitly documented in REGISTRY.md and tracked in TODO.md, so it is informational rather than blocking under this rubric.
  • Remaining findings are P3-only documentation/UX drift: incorrect R-default wording, one source-paper overstatement in the new coverage narrative, one stale registry test reference, and one outdated survey warning string.

Methodology

Affected method: SDID variance estimation. The changed bootstrap path in diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1059, diff_diff/utils.py:L1301-L1726, and rust/src/weights.rs:L125-L558 re-estimates both ω and λ on each bootstrap draw, keeps the established sqrt((r-1)/r) * sd(...) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference(). That is consistent with the SDID source reference and the official synthdid variance implementation, which documents bootstrap/jackknife/placebo as Algorithms 2/3/4 and implements bootstrap by resampling all units, renormalizing omega, and re-entering synthdid_estimate with stored opts. (aeaweb.org)

  • Severity P3. Impact: diff_diff/synthetic_did.py:L50-L61 and METHODOLOGY_REVIEW.md:L517-L519 now make the variance defaults internally inconsistent: the placebo bullet/review note still say placebo is “R’s default,” while the same PR also describes bootstrap as matching “R’s default,” and upstream synthdid docs/source show vcov() defaults to bootstrap. This does not change runtime behavior, but it misstates the reference implementation and obscures that the library’s placebo default is a divergence from R rather than a match. Concrete fix: change those lines to say placebo is the library default, or add a REGISTRY.md Note (deviation from R) if that divergence is intentional. (synth-inference.github.io)
  • Severity P3. Impact: docs/methodology/REGISTRY.md:L1566-L1570 says the PR’s smaller-panel jackknife over-rejection is “in line” with Arkhangelsky et al. §6.3’s “98% / 93% coverage pattern.” The cited paper preview presents mixed evidence instead: the iid design is slightly conservative at 98% coverage, while the dependent-error design is 93% coverage. The current sentence overstates what the cited source supports. Concrete fix: rephrase this as mixed jackknife calibration in the paper, rather than uniformly anti-conservative support. (aeaweb.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3. Impact: diff_diff/synthetic_did.py:L1128-L1133 and diff_diff/synthetic_did.py:L1220-L1224 still tell users to “Consider using variance_method='bootstrap'” from placebo failure paths. After this PR, bootstrap rejects every survey design, so that guidance is now wrong on the newly restricted survey-weighted path. Concrete fix: branch those warning strings on survey presence and suggest jackknife / more controls for survey users.

Tech Debt

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: docs/methodology/REGISTRY.md:L1570 still points readers to TestPValueSemantics::test_bootstrap_p_value_null_calibration, but the PR renamed the actual regression guard to tests/test_methodology_sdid.py:L2604-L2657. The registry now references a non-existent test when describing the calibration safeguard. Concrete fix: update the registry reference to test_bootstrap_p_value_null_dispersion.

Four P3-only items from R8 CI review:

1. Correctly attribute R's default `vcov()` method:
   - diff_diff/synthetic_did.py:53 docstring previously claimed placebo was
     "R's default". R's `synthdid::vcov()` actually defaults to
     `method="bootstrap"`. Reword to describe placebo as the library default
     with a rationale paragraph (survey availability, perf) and cross-
     reference to the REGISTRY Note below.
   - METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite
     to frame the default as a deliberate library deviation with the same
     two-reason rationale.

2. Add a REGISTRY.md Note (default variance_method deviation from R)
   that documents the rationale explicitly (survey availability + perf)
   so the AI reviewer recognizes the deviation as documented rather than
   as an accidental contradiction between the docstring and R's actual
   default.

3. Soften the placebo-failure fallback guidance in
   `_placebo_variance_se`: the previous strings recommended
   `variance_method="bootstrap"`, which now raises `NotImplementedError`
   on every survey design. Branch on `w_control is not None` (survey
   fit) to recommend jackknife + adding controls for survey users and
   keep the bootstrap + jackknife + more-controls recommendation for
   non-survey users.

4. Fix the REGISTRY pointer to the slow dispersion guard: reference
   `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the
   current name) instead of the pre-rename
   `test_bootstrap_p_value_null_calibration`, and describe the new
   contract (calibration-agnostic dispersion + loose rejection-rate
   band).

5. Rephrase the REGISTRY coverage-MC narrative about jackknife
   anti-conservatism: the paper's AER §6.3 shows mixed jackknife
   evidence (98% iid — slightly conservative; 93% AR(1) — slightly
   anti-conservative). Our observed anti-conservatism aligns with the
   AR(1) branch of the paper's evidence, not a uniform "in line"
   pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 710f96654d443e7a6546febe650e0e7cc97e7578


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 findings in the changed SDID estimator / SE / inference path.
  • The refit bootstrap implementation now matches the documented methodology surface and the official synthdid variance contract, rather than the old fixed-weight shortcut.
  • The prior re-review P3s appear addressed: default-vs-R wording, the jackknife coverage narrative, the stale registry test reference, and the survey-aware placebo fallback messaging are all updated in the changed files.
  • The remaining items are P3-only: one user-facing runtime-guidance inconsistency, plus tracked tech-debt items around survey-bootstrap composition and restoring an external parity anchor.
  • This was a static review only; I could not execute tests in this workspace because numpy and pytest are not installed.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:L834-L1063, together with the warm-start / convergence plumbing in diff_diff/utils.py:L1301-L1715 and rust/src/weights.rs:L121-L555, is consistent with the methodology registry in docs/methodology/REGISTRY.md:L1497-L1552 and with the official synthdid surface: vcov() documents bootstrap/jackknife/placebo as Algorithms 2/3/4, defaults to bootstrap, and the R bootstrap code renormalizes omega then re-enters synthdid_estimate() with stored opts, while synthdid_estimate() treats update.omega / update.lambda as re-estimation from the passed weights as initializations. (synth-inference.github.io)

Code Quality

No findings.

Performance

No code-level findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked). Impact: diff_diff/synthetic_did.py:L315-L341, docs/methodology/REGISTRY.md:L1549-L1551, and TODO.md:L107-L108 explicitly document that paper-faithful bootstrap currently rejects all survey designs; full-design survey users therefore have no SDID variance path in this release, and pweight-only users must use placebo/jackknife. Concrete fix: none for approval; the weighted-Frank-Wolfe / Rao-Wu composition follow-up is already tracked.
  • Severity P3 (tracked). Impact: the new external parity anchor is deferred in TODO.md:L108-L108, and the registry now explicitly avoids claiming bit-level cross-language bootstrap parity in docs/methodology/REGISTRY.md:L1506-L1508. Concrete fix: none for approval; land the tracked R/Julia parity harness later.

Security

No findings.

Documentation/Tests

Single actionable P3 from R9 CI review: user-facing runtime wording for
refit bootstrap had diverged across surfaces, giving conflicting
expectations about the cost of the new bootstrap path:

- CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower.
- diff_diff/power.py said ~10-100x slower (two sites).
- docs/choosing_estimator.rst said ~10-100x slower.
- docs/performance-scenarios.md said ~10-100x slower.
- docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower.
- docs/tutorials/03_synthetic_did.ipynb and
  docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower.
- benchmarks/python/coverage_sdid.py said the 500-seed MC run takes
  ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed
  wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined).

Unify on "~5-30x slower than placebo (panel-size dependent)" for the
per-fit slowdown (the warm-start plumbing closed the gap vs the pre-
warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the
coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x
figure as a historical "prior estimate" — left as-is so the release
notes continue to explain the revision.

Also fix two tutorial surfaces that still called placebo "R's default"
(tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is
the library default per the REGISTRY Note added in 710f966. Reword to
describe placebo as the library default with the rationale pointer.

Verified: 353 tests pass across test_methodology_sdid, test_power,
test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift
guards unaffected because the change is markdown-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: dc2045fb781e7d16c6c3f768d9f6a4535d51d925


Overall Assessment

✅ Looks good

Static review only; I could not execute tests in this workspace because numpy, pandas, and pytest are not installed.

Executive Summary

  • No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / SE / inference path.
  • The prior P3 on inconsistent slowdown/runtime guidance appears addressed across the changed surfaces.
  • The survey-bootstrap capability regression and the loss of the old cross-language parity fixture are both explicitly documented/tracked, so they remain P3-only.
  • [Newly identified] Two unchanged survey docs still advertise the removed SDID Rao-Wu bootstrap path; this is a P3 documentation cleanup.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:594, diff_diff/synthetic_did.py:846, and docs/methodology/REGISTRY.md:1497 is consistent with the source-material contract: official synthdid docs map bootstrap / jackknife / placebo to Algorithms 2 / 3 / 4 and default vcov() to bootstrap; bootstrap_sample() passes renormalized omega plus stored opts back into synthdid_estimate(); and synthdid_estimate() treats supplied weights as initializations while update.omega / update.lambda remain enabled. citeturn1view0turn1view1turn2view0

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked). Impact: the survey-support regression is explicit and documented: diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334 now reject full designs and any bootstrap-plus-survey combination, while docs/methodology/REGISTRY.md:1549 and TODO.md:107 track the weighted-Frank-Wolfe / Rao-Wu follow-up. Concrete fix: none for approval; land the tracked weighted-FW survey-bootstrap work later.
  • Severity P3 (tracked). Impact: the old fixed-weight R fixture was removed, so the refit bootstrap no longer has an external parity anchor in-tree; the follow-up is explicitly tracked in TODO.md:108. Concrete fix: none for approval; add the planned Julia/R parity harness later.

Security

No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026
The Pure Python Fallback CI job failed this test at rel-diff 0.5310 >
0.40 tolerance. Root cause is test-infrastructure, not a correctness
regression.

ci_params.bootstrap(min_n=...) silently caps min_n at 49 in pure-Python
mode to keep CI fast (see tests/conftest.py:210); the test's 0.40
tolerance was explicitly calibrated for B∈[100, 200] per its docstring
comment. At B=49 the bootstrap SE is not yet converged to the placebo
SE (rel-diff 0.5310 at B=49; 0.3856 at B=100; 0.2708 at B=200 on the
same seed), so the failure is MC-noise, not a regression.

The 15 Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3
Python versions) all run the test at the full B=200 and pass with
comfortable margin — the regression guard is still exercised on the
default user install path.

Skip under pure-Python mode with an explicit rationale citing the
min_n cap and the Rust-backed coverage that preserves the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 23, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b123c2b763ec0d92584c8e5f1130a92140e3d343


Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

Methodology

No unmitigated findings. The changed refit bootstrap path at diff_diff/synthetic_did.py:846 is consistent with the updated methodology registry at docs/methodology/REGISTRY.md:1497, and bootstrap/jackknife inference still routes through safe_inference() at diff_diff/synthetic_did.py:648, so I did not find a new P0/P1 NaN/inference regression.

Code Quality

  • Severity P2. Impact: In _bootstrap_se, fw_nonconvergence_count is incremented before the draw is accepted into bootstrap_estimates, so a draw that later fails the np.isfinite(tau) gate can still count toward the warning numerator while the denominator is n_successful valid draws. That does not change the reported SE, but it can over-trigger or overstate the documented “share of valid bootstrap draws” warning contract. Concrete fix: increment the counter only after the draw passes the finite-tau gate, or keep a separate fw_nonconvergence_valid_count for the warning path. References: diff_diff/synthetic_did.py:982, diff_diff/synthetic_did.py:993, diff_diff/synthetic_did.py:1045, docs/methodology/REGISTRY.md:1508.

Performance

No findings. The refit-bootstrap slowdown is explicit and documented, not an accidental regression. See CHANGELOG.md:15.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked). Impact: SDID survey-bootstrap support is intentionally removed and front-door rejected in this release; that is a capability regression, but it is documented and tracked rather than silently wrong. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu composition follow-up tracked in TODO.md:107 and described in docs/methodology/REGISTRY.md:1551.
  • Severity P3 (tracked). Impact: deleting the old R bootstrap fixture removes the only external cross-language parity anchor for SDID bootstrap math; protection now relies on internal baselines and characterization tests. Concrete fix: none for approval; add the planned Julia/R parity harness tracked in TODO.md:108.

Security

No findings.

Documentation/Tests

@igerber igerber removed the ready-for-ci Triggers CI test workflows label Apr 23, 2026
R10 CI review found two items on top of the previous ✅ Looks good.

P2 Code Quality — aggregate Frank-Wolfe non-convergence warning
numerator/denominator mismatch. In ``_bootstrap_se``,
``fw_nonconvergence_count`` was incremented before the draw cleared
the ``np.isfinite(tau)`` gate. A draw that failed FW convergence AND
then produced non-finite τ would count toward the warning numerator
while the denominator is ``n_successful`` (draws that cleared the
finite-τ gate). That does not affect the reported SE, but it can
overstate the documented "share of valid bootstrap draws" warning
contract and cause the warning to over-trigger.

Fix: move the increment inside the ``if np.isfinite(tau)`` block so
the numerator only counts draws that also contribute to the SE. A
draw failing the finite-τ gate is retried upstream and should not
inflate the non-convergence rate.

P3 Documentation (previously unresolved) — two survey-cross-reference
docs still advertised SyntheticDiD Rao-Wu bootstrap support, which
the estimator now rejects at fit-time with NotImplementedError:

- ``docs/methodology/survey-theory.md:725`` — rewrite the Rao-Wu bullet
  to exclude SDID explicitly, with a pointer to the REGISTRY sketch
  for the deferred weighted-FW composition and to pweight-only
  placebo/jackknife as the available SDID variance alternatives.
- ``docs/tutorials/16_survey_did.ipynb`` cell-35-f1ef376c — update the
  support-matrix table so SDID's row reads
  "pweight only (placebo / jackknife)" with bootstrap struck out, and
  add a "Note on SyntheticDiD" below explaining which methods accept
  pweight-only and why bootstrap rejects all survey designs (weighted-
  FW derivation tracked in TODO.md).

Test coverage unchanged: TestBootstrapSE ran the full 7 under Rust
with 48-of-50 non-convergence warning still firing on the regression
test, confirming the warning numerator still tallies correctly after
the gate-order change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 23, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5eadcb6dbd1c11aca9a70aeb5a7d54c0ff09296b


Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

  • No unmitigated P0/P1 findings in the changed SDID estimator / inference path.
  • The refit bootstrap implementation in diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 now re-estimates ω and λ on each bootstrap draw, which aligns with Algorithm 2’s “compute the SDID estimator based on (Y^(b), W^(b))” requirement and with R’s bootstrap_sample() / synthdid_estimate() warm-start contract. (nber.org)
  • The previous P2 on Frank-Wolfe non-convergence accounting appears addressed: fw_nonconvergence_count now increments only after a finite draw is accepted (diff_diff/synthetic_did.py:L986-L998), so numerator and denominator both reference valid draws.
  • The prior survey-doc mismatch also appears resolved in the changed docs (docs/methodology/survey-theory.md:L725-L735, docs/tutorials/16_survey_did.ipynb:L1090, docs/choosing_estimator.rst:L781-L821).
  • The removed external bootstrap parity fixture is not a blocker because the follow-up is explicitly tracked in TODO.md:L108; regression coverage on the changed surfaces is otherwise solid (tests/test_methodology_sdid.py:L500-L710, tests/test_survey_phase5.py:L179-L360, tests/test_business_report.py:L4658-L4702).

Methodology

  • No unmitigated P0/P1 findings.
  • Severity P3 (informational). Impact: variance_method="bootstrap" is now a materially different user-facing method than before; existing fits will get different SE / p-value / CI outputs. This is explicitly documented in CHANGELOG.md:L15-L21 and docs/methodology/REGISTRY.md:L1497-L1552, so it is not a defect. Concrete fix: none for approval. The implementation at diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 matches the paper / R contract for recomputing SDID on each bootstrap draw. (nber.org)

Code Quality

  • No findings.

Performance

  • No findings. The slowdown is explicitly documented as an intentional consequence of replacing the removed fixed-weight shortcut with the refit path (CHANGELOG.md:L15-L21, docs/methodology/REGISTRY.md:L1508-L1509).

Maintainability

  • No findings. Extracting _validate_config() and making set_params() transactional improves constructor / setter contract parity (diff_diff/synthetic_did.py:L188-L203, diff_diff/synthetic_did.py:L1479-L1511).

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: deleting benchmarks/R/generate_sdid_bootstrap_parity_fixture.R and tests/data/sdid_bootstrap_indices_r.json removes the only external cross-language parity anchor for SDID bootstrap in this PR. Concrete fix: none for approval; the replacement parity harness is already tracked in TODO.md:L108.
  • Severity P3 (tracked in TODO.md). Impact: SDID bootstrap no longer supports any survey design, and full-design SDID has no variance path this release. That is a real capability regression, but it is front-door rejected, documented in docs/methodology/REGISTRY.md:L1549-L1551, and tracked in TODO.md:L107, so it is not a blocker. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu follow-up.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new return_convergence=True helper contract is only partially documented; compute_time_weights() and compute_sdid_unit_weights() explain the new parameter but their Returns sections still advertise only np.ndarray, not the tuple return path. Concrete fix: update diff_diff/utils.py:L1462-L1510 and diff_diff/utils.py:L1598-L1643 to mirror _sc_weight_fw’s union-return documentation.
  • Otherwise no findings. The changed tests cover the main regression surfaces: bootstrap refit behavior / non-convergence (tests/test_methodology_sdid.py:L500-L710), survey rejection and pweight-only fallbacks (tests/test_survey_phase5.py:L179-L360), BR inference labeling (tests/test_business_report.py:L4658-L4702), and the coverage artifact schema (tests/test_methodology_sdid.py:L2976-L3005).

R11 CI review flagged that compute_time_weights and compute_sdid_unit
_weights describe the new return_convergence parameter but their
Returns sections still advertise only np.ndarray, not the tuple return.

Update both docstrings to mirror _sc_weight_fw's union-return shape:
Returns now read "np.ndarray or Tuple[np.ndarray, bool]" with an
explicit description of the two-pass AND convergence flag (True iff
both pre-sparsify and main FW passes hit the min-decrease break; False
if either hit max_iter). Also promote the return_convergence kwarg
description from the implementation-detail placement to the main
Parameters block with the same contract note as _sc_weight_fw so
standalone callers see the legacy-ABI preservation guarantee and the
Rust top-level fast-path skip.

Docs-only change; no code or test changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 23, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ec34beb8c5f98db72001cfb534ac4d5446e9be60


Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests or the Monte Carlo script.

Executive Summary

  • No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / inference path.
  • The refit bootstrap now re-estimates both ω and λ on each pairs-bootstrap draw, and bootstrap/jackknife inference still routes through safe_inference(), which matches the current Methodology Registry contract for Arkhangelsky et al. Algorithm 2 / R-default vcov(method="bootstrap"). diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552
  • Previous re-review concerns appear addressed: Frank-Wolfe non-convergence is now surfaced through explicit Rust/Python convergence flags instead of warning capture, and the tuple-return helper contracts are now documented. diff_diff/synthetic_did.py:L952-L998, diff_diff/utils.py:L1301-L1382, diff_diff/utils.py:L1462-L1667, rust/src/weights.rs:L518-L555
  • Survey support is now internally consistent across code, docs, and tests: full designs are front-door rejected on all SDID variance methods, and pweight-only surveys are limited to placebo/jackknife. This regression is documented and tracked, so it is informational only. diff_diff/synthetic_did.py:L315-L342, docs/choosing_estimator.rst:L785-L823, tests/test_survey_phase5.py:L179-L359, TODO.md:L107-L108
  • Remaining issues are P3 only: one stale fallback message on the power-analysis surface, and weak provenance metadata on the committed coverage artifact.

Methodology

No findings. The changed estimator math, SE formula, retry-to-B contract, survey guards, and bootstrap/jackknife p-value dispatch match the current Registry description. diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

No findings.

Performance

No findings. The slowdown from fixed-weight bootstrap to per-draw refit is explicit and documented rather than silent. CHANGELOG.md:L15-L21, docs/performance-scenarios.md:L237-L249

Maintainability

No findings. Extracting _validate_config() and making set_params() transactional improves constructor/setter parity. diff_diff/synthetic_did.py:L186-L203, diff_diff/synthetic_did.py:L1479-L1511

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: SDID no longer has any survey-aware variance path for strata/PSU/FPC, and even pweight-only survey fits cannot use bootstrap in this release. This is front-door rejected and documented, so it is not a blocker. Concrete fix: none for approval; implement the tracked weighted-Frank-Wolfe + Rao-Wu composition follow-up. diff_diff/synthetic_did.py:L315-L342, docs/methodology/REGISTRY.md:L1549-L1551, TODO.md:L107-L107
  • Severity P3 (tracked in TODO.md). Impact: deleting the old R fixture leaves no external cross-language parity anchor for the new refit bootstrap; validation is currently same-library only. Concrete fix: none for approval; add the tracked R/Julia parity harness. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new placebo-feasibility error text on the power-analysis surface still recommends variance_method='bootstrap' even on survey-weighted custom-DGP paths, but SyntheticDiD.fit() now rejects bootstrap for every survey design. Concrete fix: branch the fallback guidance in _check_sdid_placebo_data() and the registry-path check so survey-weighted SDID only suggests jackknife (or adding controls). diff_diff/power.py:L714-L721, diff_diff/power.py:L2043-L2051, diff_diff/synthetic_did.py:L334-L342
  • Severity P3. Impact: the committed coverage artifact records "library_version": "3.2.0" even though it documents unreleased refit-bootstrap behavior, which weakens provenance for future methodology audits/regenerations. Concrete fix: emit a git SHA or explicit dev/unreleased marker in the artifact metadata, then regenerate benchmarks/data/sdid_coverage.json. benchmarks/python/coverage_sdid.py:L343-L374, benchmarks/data/sdid_coverage.json:L2-L8
  • Otherwise no findings. The changed tests cover the main regression surfaces: refit-bootstrap behavior/non-convergence, survey rejection, setter rollback, result/report labeling, and coverage-artifact schema. tests/test_methodology_sdid.py:L500-L715, tests/test_methodology_sdid.py:L1234-L1311, tests/test_methodology_sdid.py:L2968-L3010, tests/test_business_report.py:L4658-L4700, tests/test_survey_phase5.py:L179-L359

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 23, 2026
@igerber igerber merged commit 172d1d8 into main Apr 23, 2026
23 of 24 checks passed
@igerber igerber deleted the sdid-bootstrap-refit branch April 23, 2026 23:14
igerber added a commit that referenced this pull request Apr 24, 2026
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up
to #351 which front-door rejected all survey designs). This commit adds the
weighted-FW kernel + Python wrappers; the bootstrap integration lands in
the next commit.

Rust (rust/src/weights.rs, rust/src/lib.rs):
- New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted`
  loop variants. Identical to the unweighted loops except for the
  regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in
  place of `eta*lam[j]`, and the FW step-size denominator uses the
  diag(reg_w)-weighted simplex direction norm
  `Σ_j reg_w[j]*d[j]²` (which simplifies to
  `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam).
- New `sc_weight_fw_weighted_internal` dispatcher that delegates to the
  unweighted internal when reg_weights is None (preserves the legacy
  numeric contract for any future caller that wants the generic shape).
- Two new pyfunctions: `sc_weight_fw_weighted` and
  `sc_weight_fw_weighted_with_convergence`. Same call shape as the
  existing unweighted siblings plus a trailing `reg_weights` kwarg.
  Registered in lib.rs.
- 3 new Rust unit tests in rust/src/weights.rs:
    * test_weighted_fw_reg_weights_none_delegates — bit-identity at
      rel=1e-14 against the unweighted internal.
    * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform
      rw=1 collapses to uniform regularization (rel=1e-12, allowing for
      ULP-scale drift from different float reduction orders).
    * test_weighted_fw_simplex_invariants — for arbitrary positive rw
      and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to
      1 and is non-negative.

Python (diff_diff/utils.py, diff_diff/_backend.py):
- Export _rust_sc_weight_fw_weighted and _with_convergence from _backend
  (mirrors the shape added for _rust_sc_weight_fw_with_convergence in
  PR #351 c0d089b).
- Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a
  `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust
  path, dispatches to the new weighted pyfunctions; on the pure-Python
  path, runs a weighted FW loop mirroring the Rust derivation.
- New helper `compute_sdid_unit_weights_survey(Y_pre_control,
  Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by
  rw_control and passes rw_control as reg_weights so the FW solves the
  unit-weight survey-bootstrap objective
    min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² +
                    ζ²·Σ_i rw_i·ω_i²
  Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights.
  Returns ω on the standard simplex (caller composes ω_eff downstream).
- New helper `compute_time_weights_survey(Y_pre_control, Y_post_control,
  rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes
  no reg_weights (uniform reg on λ — λ is per-period, rw is per-control,
  no alignment for per-λ weighting). Two-pass structure unchanged.
- Both new helpers expose `return_convergence=True` returning the AND of
  the two pass convergence flags, mirroring the contract added in
  PR #351 c0d089b.

Tests (tests/test_weighted_fw.py — new, 15 tests):
- _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at
  bit-identity; uniform reg matches unweighted at rel=1e-12;
  Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw;
  return_convergence tuple shape.
- compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted
  helper, simplex invariants under arbitrary rw, shape-mismatch raises,
  return_convergence AND.
- compute_time_weights_survey: same coverage matrix, plus a zero-rw
  subset test (Rao-Wu-style undrawn PSU yields valid simplex λ).
- Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7
  for both unit and time helpers (monkeypatches HAS_RUST_BACKEND).

ABI preservation: existing unweighted callers of _sc_weight_fw,
compute_sdid_unit_weights, compute_time_weights are unaffected — the new
kwarg defaults to None and dispatches to the legacy code path. The
bit-identity check on TestScaleEquivariance::test_baseline_parity_small
_scale[bootstrap] still passes at rel=1e-14 (verified in the next commit
when the bootstrap integration lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…sition

PR #352 restores the SDID survey-bootstrap capability that PR #351 front-
door rejected as a known regression. Pweight-only and full-design surveys
now both succeed; placebo / jackknife continue to reject full designs (a
separate methodology gap tracked in TODO.md).

`diff_diff/synthetic_did.py::fit` (guards):
- Replace the unconditional strata/PSU/FPC NotImpl guard with a method-
  gated version that fires only for placebo / jackknife. Rationale +
  truth-table in REGISTRY.md §SyntheticDiD survey-support matrix:

      method     pweight-only      strata/PSU/FPC
      bootstrap  ✓ (this PR)       ✓ Rao-Wu (this PR)
      placebo    ✓ unchanged       ✗ NotImpl (separate gap)
      jackknife  ✓ unchanged       ✗ NotImpl (separate gap)

- Delete the unconditional `bootstrap + any-survey` guard added in #351.
  Keep the `weight_type != "pweight"` validation (fweight/aweight still
  rejected).

`diff_diff/synthetic_did.py::fit` (survey resolution):
- After validating the per-unit survey weights (`w_treated`, `w_control`),
  also collapse the observation-level `resolved_survey` to a unit-level
  view via `collapse_survey_to_unit_level(...)` ordered as
  `[*control_units, *treated_units]`. The resulting
  `resolved_survey_unit` is what `_bootstrap_se` slices via
  `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw.

`diff_diff/synthetic_did.py::fit` (dispatcher):
- Branch the bootstrap call on whether the design is pweight-only or
  full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for
  pweight-only, `resolved_survey=resolved_survey_unit` for full design,
  None/None for non-survey.

`diff_diff/synthetic_did.py::_bootstrap_se`:
- New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword-
  only, default None — preserves the legacy signature).
- Single-PSU short-circuit: unstratified survey with <2 PSUs returns
  (NaN, []) since the bootstrap distribution is unidentified
  (resampling one PSU yields the same subset every draw). Recovered from
  the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5).
- Per-draw Rao-Wu rescaling for full designs:
  ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the
  resampled units. Pweight-only path uses ``rw = w_control[boot_idx]``
  (constant per draw, no rescaling).
- Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)``
  when survey weights are present.
- Warm-start: the simplex init scales by rw before sum_normalize when on
  the survey path, matching the per-draw weighted-FW geometry.
- Per-draw FW dispatch: survey paths call the new
  ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey``
  helpers (PR #352 commit 1) which run the weighted-FW kernel; non-
  survey paths continue to call the unweighted helpers (bit-identity
  preserved on the non-survey refit path).
- Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID
  estimator (which expects simplex weights). Degenerate-retry if
  ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls).
- Aggregate FW non-convergence warning: tally is the AND of the two
  helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b
  shape preserved, no copy change).

Tests:
- ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as
  succeeds-tests with explicit SE assertions —
    * ``test_full_design_bootstrap_succeeds`` (was ``_raises``):
      finite SE, populated survey_metadata.n_strata/n_psu, summary()
      includes Survey Design + Bootstrap replications blocks.
    * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``):
      finite SE, variance_method preserved (cross-surface guard).
    * New ``test_bootstrap_full_design_se_differs_from_pweight_only``
      resurrects the PR #351 R3-deleted differs-from contract: ATT
      matches between paths (both compose ω_eff post-fit) but SE
      differs (Rao-Wu adds PSU clustering variance).
- ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351
  raises-tests as succeeds-tests, plus add the
  ``test_bootstrap_single_psu_returns_nan`` short-circuit regression.

Verified: 308 tests pass across test_methodology_sdid /
test_business_report SDID subset / test_rust_backend / test_survey_phase5
/ test_weighted_fw / test_guides.

Bit-identity check: the non-survey refit path goes through the
unweighted helpers (no weighted-FW dispatch), so
``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]``
remains at rel=1e-14 — verified passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap
composition and propagates the landed capability across the
documentation surfaces.

Coverage MC harness (benchmarks/python/coverage_sdid.py):
- Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses
  ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/
  stratum) null-treatment panel with moderate weight variation and
  modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post
  periods). Converts per-observation ``treated`` to a unit-level
  ever-treated indicator (SDID's block-treatment requirement).
- Extend ``DGPSpec`` with an optional ``survey_design_factory``
  callable that returns ``(SurveyDesign, supported_methods_tuple)``.
  For ``stratified_survey``: bootstrap only — placebo / jackknife
  reject strata/PSU/FPC at fit-time, so the harness skips them
  rather than catching the NotImplementedError inside ``_fit_one``.
- ``_fit_one`` gains an optional ``survey_design`` kwarg routed
  through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls
  the factory once per seed (DataFrame contents don't affect
  columns) and gates methods on the supported set.

Regenerated ``benchmarks/data/sdid_coverage.json`` via
``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap
200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend);
aer63 remains the long tail at 2237 s, stratified_survey adds only
33 s.

Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at
α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration
band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the
bootstrap is slightly conservative (overestimates empirical
sampling SD by ~25%) — the safer direction under Rao-Wu rescaling
with only 4 PSUs total. Validates the weighted-FW + Rao-Wu
composition end-to-end.

REGISTRY.md §SyntheticDiD:
- Add ``stratified_survey`` row to the coverage MC table and a
  paragraph under it documenting the calibration verdict, the
  conservatism direction, and why placebo/jackknife rows are NaN.
- Replace the survey-support bullet with a truth-table matrix (PR
  #352 shape); add a ``Note (survey + bootstrap composition)``
  documenting the weighted-FW objective (unit and time forms), the
  ω_eff composition, the argmin-set caveat, the per-draw rw
  dispatch (pweight-only vs Rao-Wu), and the single-PSU
  short-circuit.
- Update the ``Note (default variance_method deviation from R)`` to
  drop the "bootstrap rejects surveys" framing (no longer accurate).
- Update the ``Note (coverage Monte Carlo calibration)`` header to
  say "4 representative null-panel DGPs" and flag stratified_survey
  as bootstrap-only.

User-facing docs:
- ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu
  Rescaled Bootstrap list; describe the weighted-FW composition.
- ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect
  full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet
  restores SDID.
- ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix
  table row for SyntheticDiD switches from "pweight only (placebo/
  jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC";
  "Note on SyntheticDiD" block rewritten for the landed contract.
- ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap
  bullet now describes survey support and the ω_eff composition.
- ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet
  includes SDID in the Rao-Wu list with the weighted-FW formula.

CHANGELOG.md:
- Retain the PR #351 regression Changed entry but annotate it as
  "restored in PR #352"; add new Added/Changed PR #352 entries
  documenting the weighted-FW kernel, survey helpers, _bootstrap_se
  Rao-Wu composition, and the new coverage MC row.

TODO.md:
- Row 103 (SDID + survey designs) → closed by PR #352; replaced
  with a narrower follow-up for placebo/jackknife + strata/PSU/FPC
  (Low priority, no concrete sketch yet).

Tests:
- ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including
  ``stratified_survey``); new explicit assertions that the
  stratified_survey bootstrap row has ≥100 successful fits and
  α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows
  n_successful_fits == 0 (strata/PSU/FPC rejection contract).

Verified: TestCoverageMCArtifact passes against the regenerated
artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant